Here is an Exploratory Data Analysis for the Google Analytics Customer Revenue Prediction competition within the R environment. For this EDA in the main we will use packages:
Also for modelling we will use packages:
Our task is to build an algorithm that predicts the natural log of the sum of all transactions per user. Thus, for every user in the test set, the target is:
\[y_{user} = \sum_{i=1}^n transaction_{user_i}\]
\[target_{user} = ln(y_{user}+1).\]
Submissions are scored on the root mean squared error, which is defined as:
\[RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \widehat{y_i})^2},\]
where \(\widehat{y}\) is the predicted revenue for a customer and \(y\) is the natural log of the actual revenue value.
Let’s prepare and have a look at the dataset.
library(h2o)
library(caret)
library(lme4)
library(ggalluvial)
library(xgboost)
library(jsonlite)
library(lubridate)
library(knitr)
library(Rmisc)
library(scales)
library(countrycode)
library(highcharter)
library(glmnet)
library(keras)
library(forecast)
library(zoo)
library(magrittr)
library(tidyverse)set.seed(0)
tr <- read_csv("./data/train.csv")
te <- read_csv("./data/test.csv")
subm <- read_csv("./data/sample_submission.csv")Train set file size: NA bytesTrain set dimensions: 903653 12Observations: 903,653 Variables: 12 $ channelGrouping
Test set file size: NA bytesTest set dimensions: 804684 12Observations: 804,684 Variables: 12 $ channelGrouping
As shown in the figure, there are only a few of the transactions after Jul 2017 in the train set, because the rest is in the test set. It makes sense to create time-based splits for train/validation sets.
There is a total of 12 features:
Let’s have a look at counts of the simple features:
Attribute ‘socialEngagementType’ can be removed as there are only 1 value
Actually the columns device, geoNetwork, trafficSource, totals contain data in JSON format. To parse it we can use jsonlite package
flatten_json <- . %>%
str_c(., collapse = ",") %>%
str_c("[", ., "]") %>%
fromJSON(flatten = T)
parse <- . %>%
bind_cols(flatten_json(.$device)) %>%
bind_cols(flatten_json(.$geoNetwork)) %>%
bind_cols(flatten_json(.$trafficSource)) %>%
bind_cols(flatten_json(.$totals)) %>%
select(-device, -geoNetwork, -trafficSource, -totals)Let’s convert train and test sets to the tidy format:
tr <- parse(tr)
te <- parse(te)| channelGrouping | date | fullVisitorId | sessionId | socialEngagementType | visitId | visitNumber | visitStartTime | browser | browserVersion | browserSize | operatingSystem | operatingSystemVersion | isMobile | mobileDeviceBranding | mobileDeviceModel | mobileInputSelector | mobileDeviceInfo | mobileDeviceMarketingName | flashVersion | language | screenColors | screenResolution | deviceCategory | continent | subContinent | country | region | metro | city | cityId | networkDomain | latitude | longitude | networkLocation | campaign | source | medium | keyword | isTrueDirect | referralPath | adContent | campaignCode | adwordsClickInfo.criteriaParameters | adwordsClickInfo.page | adwordsClickInfo.slot | adwordsClickInfo.gclId | adwordsClickInfo.adNetworkType | adwordsClickInfo.isVideoAd | visits | hits | pageviews | bounces | newVisits | transactionRevenue |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Organic Search | 20160902 | 1131660440785968503 | 1131660440785968503_1472830385 | Not Socially Engaged | 1472830385 | 1 | 1472830385 | Chrome | not available in demo dataset | not available in demo dataset | Windows | not available in demo dataset | FALSE | not available in demo dataset | not available in demo dataset | not available in demo dataset | not available in demo dataset | not available in demo dataset | not available in demo dataset | not available in demo dataset | not available in demo dataset | not available in demo dataset | desktop | Asia | Western Asia | Turkey | Izmir | (not set) | Izmir | not available in demo dataset | ttnet.com.tr | not available in demo dataset | not available in demo dataset | not available in demo dataset | (not set) | organic | (not provided) | NA | NA | NA | NA | not available in demo dataset | NA | NA | NA | NA | NA | 1 | 1 | 1 | 1 | 1 | NA | |
| Organic Search | 20160902 | 377306020877927890 | 377306020877927890_1472880147 | Not Socially Engaged | 1472880147 | 1 | 1472880147 | Firefox | not available in demo dataset | not available in demo dataset | Macintosh | not available in demo dataset | FALSE | not available in demo dataset | not available in demo dataset | not available in demo dataset | not available in demo dataset | not available in demo dataset | not available in demo dataset | not available in demo dataset | not available in demo dataset | not available in demo dataset | desktop | Oceania | Australasia | Australia | not available in demo dataset | not available in demo dataset | not available in demo dataset | not available in demo dataset | dodo.net.au | not available in demo dataset | not available in demo dataset | not available in demo dataset | (not set) | organic | (not provided) | NA | NA | NA | NA | not available in demo dataset | NA | NA | NA | NA | NA | 1 | 1 | 1 | 1 | 1 | NA |
| channelGrouping | date | fullVisitorId | sessionId | socialEngagementType | visitId | visitNumber | visitStartTime | browser | browserVersion | browserSize | operatingSystem | operatingSystemVersion | isMobile | mobileDeviceBranding | mobileDeviceModel | mobileInputSelector | mobileDeviceInfo | mobileDeviceMarketingName | flashVersion | language | screenColors | screenResolution | deviceCategory | continent | subContinent | country | region | metro | city | cityId | networkDomain | latitude | longitude | networkLocation | campaign | source | medium | keyword | isTrueDirect | referralPath | adContent | adwordsClickInfo.criteriaParameters | adwordsClickInfo.page | adwordsClickInfo.slot | adwordsClickInfo.gclId | adwordsClickInfo.adNetworkType | adwordsClickInfo.isVideoAd | visits | hits | pageviews | newVisits | bounces |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Organic Search | 20171016 | 6167871330617112363 | 6167871330617112363_1508151024 | Not Socially Engaged | 1508151024 | 2 | 1508151024 | Chrome | not available in demo dataset | not available in demo dataset | Macintosh | not available in demo dataset | FALSE | not available in demo dataset | not available in demo dataset | not available in demo dataset | not available in demo dataset | not available in demo dataset | not available in demo dataset | not available in demo dataset | not available in demo dataset | not available in demo dataset | desktop | Asia | Southeast Asia | Singapore | (not set) | (not set) | (not set) | not available in demo dataset | myrepublic.com.sg | not available in demo dataset | not available in demo dataset | not available in demo dataset | (not set) | organic | (not provided) | TRUE | NA | NA | not available in demo dataset | NA | NA | NA | NA | NA | 1 | 4 | 4 | NA | NA | |
| Organic Search | 20171016 | 0643697640977915618 | 0643697640977915618_1508175522 | Not Socially Engaged | 1508175522 | 1 | 1508175522 | Chrome | not available in demo dataset | not available in demo dataset | Windows | not available in demo dataset | FALSE | not available in demo dataset | not available in demo dataset | not available in demo dataset | not available in demo dataset | not available in demo dataset | not available in demo dataset | not available in demo dataset | not available in demo dataset | not available in demo dataset | desktop | Europe | Southern Europe | Spain | Aragon | (not set) | Zaragoza | not available in demo dataset | rima-tde.net | not available in demo dataset | not available in demo dataset | not available in demo dataset | (not set) | organic | (not provided) | NA | NA | NA | not available in demo dataset | NA | NA | NA | NA | NA | 1 | 5 | 5 | 1 | NA |
| fullVisitorId | PredictedLogRevenue |
|---|---|
| 0000000259678714014 | 0 |
| 0000049363351866189 | 0 |
| 0000053049821714864 | 0 |
| 0000059488412965267 | 0 |
| 0000085840370633780 | 0 |
setdiff(names(tr), names(te))## [1] "campaignCode" "transactionRevenue"
tr %<>% select(-one_of("campaignCode"))The test set lacks two columns. One column is a target variable transactionRevenue. The second column (campaignCode) we remove from the train set.
Let’s find constant columns: All these useless features we can safely remove.
fea_uniq_values <- sapply(tr, n_distinct)
(fea_del <- names(fea_uniq_values[fea_uniq_values == 1]))## [1] "socialEngagementType"
## [2] "browserVersion"
## [3] "browserSize"
## [4] "operatingSystemVersion"
## [5] "mobileDeviceBranding"
## [6] "mobileDeviceModel"
## [7] "mobileInputSelector"
## [8] "mobileDeviceInfo"
## [9] "mobileDeviceMarketingName"
## [10] "flashVersion"
## [11] "language"
## [12] "screenColors"
## [13] "screenResolution"
## [14] "cityId"
## [15] "latitude"
## [16] "longitude"
## [17] "networkLocation"
## [18] "adwordsClickInfo.criteriaParameters"
## [19] "visits"
tr %<>% select(-one_of(fea_del))
te %<>% select(-one_of(fea_del))After parsing of the JSON data we can observe many missing values in the data set. Let’s find out how many missing values each feature has. We need to take into account that such values as “not available in demo dataset”, “(not set)”, “unknown.unknown”, “(not provided)” can be treated as NA.
is_na_val <- function(x) x %in% c("not available in demo dataset", "(not provided)",
"(not set)", "<NA>", "unknown.unknown", "(none)")
tr %<>% mutate_all(funs(ifelse(is_na_val(.), NA, .)))
te %<>% mutate_all(funs(ifelse(is_na_val(.), NA, .)))There is a bunch of features missing nearly completely.
tr %<>%
mutate(date = ymd(date),
hits = as.integer(hits),
pageviews = as.integer(pageviews),
bounces = as.integer(bounces),
newVisits = as.integer(newVisits),
transactionRevenue = as.numeric(transactionRevenue))
te %<>%
mutate(date = ymd(date),
hits = as.integer(hits),
pageviews = as.integer(pageviews),
bounces = as.integer(bounces),
newVisits = as.integer(newVisits)) As a target variable we use transactionRevenue which is a sub-column of the totals JSON column. It looks like this variable is multiplied by \(10^6\).
y <- tr$transactionRevenue
tr$transactionRevenue <- NULL
summary(y)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000e+04 2.493e+07 4.945e+07 1.337e+08 1.077e+08 2.313e+10 892138
We can safely replace NA values with 0.
y[is.na(y)] <- 0
summary(y)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000e+00 0.000e+00 0.000e+00 1.704e+06 0.000e+00 2.313e+10
The target variable has a wide range of values. Its distribution is right-skewed. For modelling we will use log-transformed target.
BTW, only 1.27% of all transactions have non-zero revenue:
The figure below shows that users who came via Affiliates and Social channels do not generate revenue. The most profitable channel is Referral:
Also usually first visit users generate more total revenue, but need to be noted that most user has visited the website only once.
The revenue itself can be viewed as a timeseries. There seems to be a pattern of peaks.
There is an interesting separation in target variable by isTrueDirect feature:
Let’s see if we can predict log-transformed mean daily revenue using timeseries.Here we use: zoo and forecast packages for timeseries modelling
tr %>%
bind_cols(tibble(revenue = y)) %>%
group_by(date) %>%
summarize(mean_revenue = log1p(mean(revenue/1e6))) %>%
ungroup() %>%
with(zoo(mean_revenue, order.by = date)) ->
revenue
h <- max(te$date) - min(te$date) + 1
revenue %>%
autoplot() +
geom_line() +
geom_smooth() +
labs(x = "", y = "log(revenue)") +
theme_minimal()We use simple auto.arima model. We need to forecast for the period of 272 days.
m_aa <- auto.arima(revenue)
summary(m_aa)## Series: revenue
## ARIMA(4,0,3) with non-zero mean
##
## Coefficients:
## ar1 ar2 ar3 ar4 ma1 ma2 ma3 mean
## -0.2547 -0.4445 0.3578 -0.0995 0.6029 0.5490 -0.3260 0.8924
## s.e. 0.3520 0.2285 0.2290 0.0782 0.3492 0.3354 0.3007 0.0249
##
## sigma^2 estimated as 0.1451: log likelihood=-162.27
## AIC=342.54 AICc=343.04 BIC=377.66
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.0003557683 0.3766821 0.2882868 -Inf Inf 0.790288
## ACF1
## Training set -0.004423992
forecast(m_aa, h = h) %>%
autoplot() +
theme_minimal()Clearly, this model is of no use for long time period forecasting. Let’s add a regression term mean pageviews:
tr %>%
group_by(date) %>%
summarize(mean_pv = log1p(mean(pageviews, na.rm=TRUE))) %>%
ungroup() %$%
mean_pv ->
mean_pv_tr
te %>%
group_by(date) %>%
summarize(mean_pv = log1p(mean(pageviews, na.rm=TRUE))) %>%
ungroup() %$%
mean_pv ->
mean_pv_tem_aa_reg <- auto.arima(revenue, xreg = mean_pv_tr)
summary(m_aa_reg)## Series: revenue
## Regression with ARIMA(3,1,2) errors
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 xreg
## 0.7110 -0.2203 -0.0852 -1.3928 0.4779 2.4881
## s.e. 0.2007 0.0800 0.0688 0.1959 0.1750 0.3198
##
## sigma^2 estimated as 0.1216: log likelihood=-131.05
## AIC=276.11 AICc=276.42 BIC=303.41
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.01178035 0.345297 0.2658876 -Inf Inf 0.7288843
## ACF1
## Training set -0.004276301
forecast(m_aa_reg, h = h, xreg = mean_pv_te) %>%
autoplot() +
theme_minimal()This forecast looks much better. It is possible to use it in a more complex model.
The dataset contains the timestamp column visitStartTime expressed as POSIX time. It allows us to create a bunch of features. Let’s check ‘symmetric differences’ of the time features from the train and test sets.
tr_vst <- as_datetime(tr$visitStartTime)
te_vst <- as_datetime(te$visitStartTime)
symdiff <- function(x, y) setdiff(union(x, y), intersect(x, y))Year:
symdiff(tr_vst %>% year %>% unique, te_vst %>% year %>% unique)## [1] 2016 2018
Month:
symdiff(tr_vst %>% month %>% unique, te_vst %>% month %>% unique)## [1] 6 7
Day:
symdiff(tr_vst %>% day %>% unique, te_vst %>% day %>% unique)## integer(0)
Week:
symdiff(tr_vst %>% week %>% unique, te_vst %>% week %>% unique)## [1] 25 24 26 20 22 27 19 28 23 30 21 29
Day of the year:
symdiff(tr_vst %>% yday %>% unique, te_vst %>% yday %>% unique)## [1] 174 175 164 165 172 173 122 162 163 182 181 139 140 176 150 149 136
## [18] 135 213 168 169 167 366 186 185 131 132 179 178 166 195 196 157 158
## [35] 205 206 143 142 123 124 159 171 210 209 204 145 146 138 141 187 188
## [52] 203 202 126 127 152 153 134 128 129 170 161 160 156 125 184 192 193
## [69] 183 148 154 155 144 208 207 201 130 189 133 137 194 200 212 190 151
## [86] 177 211 147 191 199 197 198 180
Hour:
symdiff(tr_vst %>% hour %>% unique, te_vst %>% hour %>% unique)## integer(0)
We can see that some time features (week, month, day of the year) from the train and test sets differ notably. Thus, they can cause overfitiing, but year and hour can be useful.
Summary:
This kind of plot is useful for discovering of multi-feature interactions.
The next figure shows the flows for the case when revenue > 0:
Non-zero transaction revenue in the main is yielded by the flow US-desktop-Chrome-{OrganicSearch | Referral}-net.
Some features are categorical and we reencode them as OHE (with reduced set of levels). The ID columns are dropped.
m <- tr %>%
dplyr::mutate(year = year(date),
month = month(date),
day = day(date),
isMobile = ifelse(isMobile, 1L, 0L),
isTrueDirect = ifelse(isMobile, 1L, 0L)) %>%
mutate_all(funs(ifelse(is.na(.), 0, .))) %>%
select(-date, -fullVisitorId, -visitId, -sessionId) %>%
mutate_if(is.character, factor) %>%
mutate_if(is.factor, fct_lump, prop = 0.01) %>%
model.matrix(~ . - 1, .) %>%
cor(y) %>%
data.table::as.data.table(keep.rownames=TRUE) %>%
purrr::set_names("Feature", "rho") %>%
arrange(-rho)
m %>%
ggplot(aes(x = rho)) +
geom_histogram(bins = 50, fill="steelblue") +
labs(x = "correlation") +
theme_minimal()The values of the correlation coefficient are concentrated around zero, but there are several values bigger than 0.1:
m %>%
filter(rho > 0.1) %>%
kable()| Feature | rho |
|---|---|
| pageviews | 0.1555894 |
| hits | 0.1543326 |
Let’s visualize the relationship of the target variable with each of the correlated variables.
Here we observe weak positive relationship. Although, these features can play important role in a statistical model.
To train an autoencoder we can use h2o package:
h2o.no_progress()
h2o.init(nthreads = 4, max_mem_size = "10G")##
## H2O is not running yet, starting it now...
##
## Note: In case of errors look at the following log files:
## C:\Users\ANDY~1.YUA\AppData\Local\Temp\Rtmpq6o9O9/h2o_andy_yuan_started_from_r.out
## C:\Users\ANDY~1.YUA\AppData\Local\Temp\Rtmpq6o9O9/h2o_andy_yuan_started_from_r.err
##
##
## Starting H2O JVM and connecting: . Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 7 seconds 118 milliseconds
## H2O cluster timezone: Asia/Singapore
## H2O data parsing timezone: UTC
## H2O cluster version: 3.20.0.8
## H2O cluster version age: 1 month and 29 days
## H2O cluster name: H2O_started_from_R_andy.yuan_xgj592
## H2O cluster total nodes: 1
## H2O cluster total memory: 8.89 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Algos, AutoML, Core V3, Core V4
## R Version: R version 3.5.1 (2018-07-02)
tr_h2o <- as.h2o(tr)
te_h2o <- as.h2o(te)
n_ae <- 4Let’s train a simple model, which compresses the input space to 4 components:
m_ae <- h2o.deeplearning(training_frame = tr_h2o,
x = 1:ncol(tr_h2o),
autoencoder = T,
activation="Rectifier",
reproducible = TRUE,
seed = 0,
sparse = T,
standardize = TRUE,
hidden = c(32, n_ae, 32),
max_w2 = 5,
epochs = 25)
tr_ae <- h2o.deepfeatures(m_ae, tr_h2o, layer = 2) %>% as_tibble
te_ae <- h2o.deepfeatures(m_ae, te_h2o, layer = 2) %>% as_tibble
rm(tr_h2o, te_h2o, m_ae); invisible(gc())
h2o.shutdown(prompt = FALSE)## [1] TRUE
In the figures we can observe a two-class separation. This information can be useful for statistical models.
In this section I’d like to present a linear mixed model. The data set in this competition is hierarchical and contains session-level entries, but we have to predict aggregated - user level - values. We can consider rows with identical fullVisitorId as repeated measurements. Thus, the grouping factor here is fullVisitorId. For the LMM we will use the most important variables identified by the XGB model. There are several packages for LMM in R. I use the lme4 package. You can find a tutorial for LMM here.
tri <- 1:nrow(tr)
data <- tr %>%
bind_cols(tibble(revenue = y)) %>%
bind_rows(te) %>%
select(revenue, pageviews, hits, visitNumber, fullVisitorId) %>%
mutate_each(funs(as.numeric(.) %>% log1p), -fullVisitorId) %>%
mutate(pageviews = ifelse(is.na(pageviews), 0, pageviews))
m_lmm0 <- glmer(revenue ~ (1|fullVisitorId), data = data[tri, ])
bg_var <- summary(m_lmm0)$varcor$fullVisitorId[1]
resid_var <- attr(summary(m_lmm0)$varcor, "sc")^2
summary(m_lmm0)## Linear mixed model fit by REML ['lmerMod']
## Formula: revenue ~ (1 | fullVisitorId)
## Data: data[tri, ]
##
## REML criterion at convergence: 3810968
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.1182 -0.1005 -0.1005 -0.0950 11.7280
##
## Random effects:
## Groups Name Variance Std.Dev.
## fullVisitorId (Intercept) 0.2302 0.4798
## Residual 3.7493 1.9363
## Number of obs: 903653, groups: fullVisitorId, 714167
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 0.206497 0.002154 95.86
The between-group-variance is estimated as 0.2302456. The total variance is equal to 3.9795885. Thus, the intraclass-correlation coefficient (ICC) is equal to 0.058. Such low ICC demonstrates that the rows in the same group don’t resemble each other that much.
Let’s create more complex models:
m_lmm1 <- update(m_lmm0, revenue ~ pageviews + (1|fullVisitorId))
m_lmm2 <- update(m_lmm0, revenue ~ pageviews + hits + (1|fullVisitorId))
m_lmm3 <- update(m_lmm0, revenue ~ pageviews + hits + visitNumber + (1|fullVisitorId))
anova(m_lmm0, m_lmm1, m_lmm2, m_lmm3)## Data: data[tri, ]
## Models:
## m_lmm0: revenue ~ (1 | fullVisitorId)
## m_lmm1: revenue ~ pageviews + (1 | fullVisitorId)
## m_lmm2: revenue ~ pageviews + hits + (1 | fullVisitorId)
## m_lmm3: revenue ~ pageviews + hits + visitNumber + (1 | fullVisitorId)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## m_lmm0 3 3810963 3810999 -1905479 3810957
## m_lmm1 4 3718035 3718082 -1859013 3718027 92930.7 1 < 2.2e-16 ***
## m_lmm2 5 3716490 3716549 -1858240 3716480 1546.7 1 < 2.2e-16 ***
## m_lmm3 6 3712719 3712789 -1856353 3712707 3773.5 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pred_lmm <- predict(m_lmm3)
rm(data, m_lmm0, m_lmm1, m_lmm2, m_lmm3); invisible(gc())The most complex model has the lowest AIC and we will use it for predictions.
It is always useful at the begining of the competition to create several simple models and compare them. Here for all models we use a preprocessed dataset:
grp_mean <- function(x, grp) ave(x, grp, FUN = function(x) mean(x, na.rm = TRUE))
idx <- tr$date < ymd("20170701")
id <- te[, "fullVisitorId"]
tri <- 1:nrow(tr)
tr_te <- tr %>%
bind_rows(te) %>%
mutate(year = year(date) %>% factor(),
wday = wday(date) %>% factor(),
hour = hour(as_datetime(visitStartTime)) %>% factor(),
isMobile = ifelse(isMobile, 1L, 0L),
isTrueDirect = ifelse(isTrueDirect, 1L, 0L),
adwordsClickInfo.isVideoAd = ifelse(!adwordsClickInfo.isVideoAd, 0L, 1L)) %>%
select(-date, -fullVisitorId, -visitId, -sessionId, -hits, -visitStartTime) %>%
mutate_if(is.character, factor) %>%
mutate(pageviews_mean_vn = grp_mean(pageviews, visitNumber),
pageviews_mean_country = grp_mean(pageviews, country),
pageviews_mean_city = grp_mean(pageviews, city),
pageviews_mean_dom = grp_mean(pageviews, networkDomain),
pageviews_mean_ref = grp_mean(pageviews, referralPath)) %T>%
glimpse()## Observations: 1,708,337
## Variables: 36
## $ channelGrouping <fct> Organic Search, Organic Search,...
## $ visitNumber <int> 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1...
## $ browser <fct> Chrome, Firefox, Chrome, UC Bro...
## $ operatingSystem <fct> Windows, Macintosh, Windows, Li...
## $ isMobile <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1...
## $ deviceCategory <fct> desktop, desktop, desktop, desk...
## $ continent <fct> Asia, Oceania, Europe, Asia, Eu...
## $ subContinent <fct> Western Asia, Australasia, Sout...
## $ country <fct> Turkey, Australia, Spain, Indon...
## $ region <fct> Izmir, NA, Community of Madrid,...
## $ metro <fct> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ city <fct> Izmir, NA, Madrid, NA, NA, NA, ...
## $ networkDomain <fct> ttnet.com.tr, dodo.net.au, NA, ...
## $ campaign <fct> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ source <fct> google, google, google, google,...
## $ medium <fct> organic, organic, organic, orga...
## $ keyword <fct> NA, NA, NA, google + online, NA...
## $ isTrueDirect <int> NA, NA, NA, NA, 1, NA, NA, NA, ...
## $ referralPath <fct> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ adContent <fct> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ adwordsClickInfo.page <fct> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ adwordsClickInfo.slot <fct> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ adwordsClickInfo.gclId <fct> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ adwordsClickInfo.adNetworkType <fct> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ adwordsClickInfo.isVideoAd <int> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ pageviews <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ bounces <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ newVisits <int> 1, 1, 1, 1, NA, 1, 1, 1, 1, 1, ...
## $ year <fct> 2016, 2016, 2016, 2016, 2016, 2...
## $ wday <fct> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6...
## $ hour <fct> 15, 5, 1, 5, 13, 9, 11, 10, 8, ...
## $ pageviews_mean_vn <dbl> 3.347945, 3.347945, 3.347945, 3...
## $ pageviews_mean_country <dbl> 1.907835, 3.137253, 2.757053, 2...
## $ pageviews_mean_city <dbl> 1.601546, 1.000000, 2.523963, 1...
## $ pageviews_mean_dom <dbl> 1.728231, 2.856089, 1.000000, 1...
## $ pageviews_mean_ref <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
rm(tr, te, tr_ae, te_ae); invisible(gc())To make a submission file we will use this function:
submit <- . %>%
as_tibble() %>%
set_names("y") %>%
mutate(y = ifelse(y < 0, 0, expm1(y))) %>%
bind_cols(id) %>%
group_by(fullVisitorId) %>%
summarise(y = log1p(sum(y))) %>%
right_join(
read_csv("./data/sample_submission.csv"),
by = "fullVisitorId") %>%
mutate(PredictedLogRevenue = round(y, 5)) %>%
select(-y) %>%
write_csv(sub)For the glmnet model we need a model matrix. We replace NA values with zeros, rare factor levels are lumped:
tr_te_ohe <- tr_te %>%
mutate_if(is.factor, fct_explicit_na) %>%
mutate_if(is.numeric, funs(ifelse(is.na(.), 0L, .))) %>%
mutate_if(is.factor, fct_lump, prop = 0.05) %>%
select(-adwordsClickInfo.isVideoAd) %>%
model.matrix(~.-1, .) %>%
scale() %>%
round(4)
X <- tr_te_ohe[tri, ]
X_test <- tr_te_ohe[-tri, ]
rm(tr_te_ohe); invisible(gc())The next step is to create a cross-validated LASSO model:
m_glm <- cv.glmnet(X, log1p(y), alpha = 0, family="gaussian",
type.measure = "mse", nfolds = 5)Finally, we create predictions of the LASSO model
pred_glm_tr <- predict(m_glm, X, s = "lambda.min") %>% c()
pred_glm <- predict(m_glm, X_test, s = "lambda.min") %>% c()
sub <- "glmnet_gs.csv"
submit(pred_glm)
rm(m_glm); invisible(gc())For a neural net we can use the same model matrix. Let’s create a simple sequential model:
m_nn <- keras_model_sequential()
m_nn %>%
layer_dense(units = 256, activation = "relu", input_shape = ncol(X)) %>%
layer_dropout(rate = 0.5) %>%
layer_dense(units = 128, activation = "relu") %>%
layer_dropout(rate = 0.25) %>%
layer_dense(units = 1, activation = "linear")Next, we compile the model with appropriate parameters:
m_nn %>% compile(loss = "mean_squared_error",
metrics = custom_metric("rmse", function(y_true, y_pred)
k_sqrt(metric_mean_squared_error(y_true, y_pred))),
optimizer = optimizer_adadelta())Then we train the model:
history <- m_nn %>%
fit(X, log1p(y),
epochs = 50,
batch_size = 128,
verbose = 0,
validation_split = 0.2,
callbacks = callback_early_stopping(patience = 5))And finally, predictions:
pred_nn_tr <- predict(m_nn, X) %>% c()
pred_nn <- predict(m_nn, X_test) %>% c()
sub <- "keras_gs.csv"
submit(pred_nn)
rm(m_nn, X, X_test); invisible(gc())At last, we are ready to create a simple XGB model. First, we need to preprocess the dataset. We don’t care about NA values - XGB handles them by default:
tr_te_xgb <- tr_te %>%
mutate_if(is.factor, as.integer) %>%
glimpse()## Observations: 1,708,337
## Variables: 36
## $ channelGrouping <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ visitNumber <int> 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1...
## $ browser <int> 39, 57, 39, 115, 39, 39, 39, 39...
## $ operatingSystem <int> 21, 8, 21, 7, 1, 21, 21, 21, 21...
## $ isMobile <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1...
## $ deviceCategory <int> 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 3...
## $ continent <int> 3, 5, 4, 3, 4, 4, 3, 5, 4, 4, 3...
## $ subContinent <int> 21, 1, 19, 16, 13, 19, 18, 1, 2...
## $ country <int> 211, 13, 187, 95, 218, 101, 155...
## $ region <int> 193, NA, 99, NA, NA, NA, NA, 33...
## $ metro <int> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ city <int> 378, NA, 475, NA, NA, NA, NA, 1...
## $ networkDomain <int> 37454, 10098, NA, NA, NA, 12548...
## $ campaign <int> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ source <int> 207, 207, 207, 207, 207, 207, 2...
## $ medium <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4...
## $ keyword <int> NA, NA, NA, 967, NA, NA, NA, NA...
## $ isTrueDirect <int> NA, NA, NA, NA, 1, NA, NA, NA, ...
## $ referralPath <int> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ adContent <int> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ adwordsClickInfo.page <int> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ adwordsClickInfo.slot <int> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ adwordsClickInfo.gclId <int> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ adwordsClickInfo.adNetworkType <int> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ adwordsClickInfo.isVideoAd <int> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ pageviews <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ bounces <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ newVisits <int> 1, 1, 1, 1, NA, 1, 1, 1, 1, 1, ...
## $ year <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ wday <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6...
## $ hour <int> 16, 6, 2, 6, 14, 10, 12, 11, 9,...
## $ pageviews_mean_vn <dbl> 3.347945, 3.347945, 3.347945, 3...
## $ pageviews_mean_country <dbl> 1.907835, 3.137253, 2.757053, 2...
## $ pageviews_mean_city <dbl> 1.601546, 1.000000, 2.523963, 1...
## $ pageviews_mean_dom <dbl> 1.728231, 2.856089, 1.000000, 1...
## $ pageviews_mean_ref <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
rm(tr_te); invisible(gc()) Second, we create train, validation and test sets. We use time-based split:
dtest <- xgb.DMatrix(data = data.matrix(tr_te_xgb[-tri, ]))
tr_te_xgb <- tr_te_xgb[tri, ]
dtr <- xgb.DMatrix(data = data.matrix(tr_te_xgb[idx, ]), label = log1p(y[idx]))
dval <- xgb.DMatrix(data = data.matrix(tr_te_xgb[!idx, ]), label = log1p(y[!idx]))
dtrain <- xgb.DMatrix(data = data.matrix(tr_te_xgb), label = log1p(y))
cols <- colnames(tr_te_xgb)
rm(tr_te_xgb); invisible(gc)The next step is to train the model:
p <- list(objective = "reg:linear",
booster = "gbtree",
eval_metric = "rmse",
nthread = 4,
eta = 0.05,
max_depth = 7,
min_child_weight = 5,
gamma = 0,
subsample = 0.8,
colsample_bytree = 0.7,
colsample_bylevel = 0.6,
nrounds = 2000)
set.seed(0)
m_xgb <- xgb.train(p, dtr, p$nrounds, list(val = dval), print_every_n = 100, early_stopping_rounds = 100)## [1] val-rmse:2.099455
## Will train until val_rmse hasn't improved in 100 rounds.
##
## [101] val-rmse:1.699214
## [201] val-rmse:1.699873
## Stopping. Best iteration:
## [102] val-rmse:1.699087
xgb.importance(cols, model = m_xgb) %>%
xgb.plot.importance(top_n = 25)Finally, we make predictions:
pred_xgb_tr <- predict(m_xgb, dtrain)
pred_xgb <- predict(m_xgb, dtest)
sub <- "xgb_gs.csv"
submit(pred_xgb)
rm(dtr, dtrain, dval, dtest, m_xgb); invisible(gc)Let’s compare predictions for the train set:
As we can see the distributions of the predictions are quite different. The XGB model tends to produce more narrow interval - closer to the true distribution.
Average prediction of glm/keras/xgboost
pred_avg <- log1p((expm1(pred_glm) + expm1(pred_nn) + expm1(pred_xgb)) / 3)
sub <- "avg_gs.csv"
submit(pred_xgb)Distributions of the predictions for the test set differ much too, nevertheless after proper tuning of the models they can be useful for ensembling.